-
-
Notifications
You must be signed in to change notification settings - Fork 145
fix: date level stats #1312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: date level stats #1312
Conversation
read all stream.json files for a dataset get sum of stats for a given date from the manifest list in snapshot no need to calculate querier / ingestor stats separately
WalkthroughThe changes rename the function Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant QueryHandler
participant ClusterModule
Client->>QueryHandler: HTTP GET /stats?date=YYYY-MM-DD
QueryHandler->>QueryHandler: Parse and validate query parameters
QueryHandler->>ClusterModule: fetch_daily_stats(date, stream_meta_list)
ClusterModule-->>QueryHandler: Stats
QueryHandler-->>Client: JSON response with stats or error
Suggested labels
Suggested reviewers
Poem
Tip ⚡️ Faster reviews with caching
Enjoy the performance boost—your workflow just got faster. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (10)
🔇 Additional comments (2)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
src/handlers/http/cluster/mod.rs (1)
396-421
: 🛠️ Refactor suggestionUse
NaiveDate
comparison to avoid repeated string allocations and potential locale issues
time_lower_bound.date_naive().to_string() == date
converts every manifest’s date into a heap-allocatedString
on every iteration and compares it to the user-supplied string.
A cheaper and less error-prone approach is:
- Parse the incoming
&str
once intoNaiveDate
.- Compare the two
NaiveDate
values directly.-pub fn fetch_daily_stats( - date: &str, +pub fn fetch_daily_stats( + date: &str, stream_meta_list: &[ObjectStoreFormat], ) -> Result<Stats, StreamError> { - // for the given date, get the stats from the ingestors + // For the given date, collect stats from all manifests let mut events_ingested = 0; let mut ingestion_size = 0; let mut storage_size = 0; - for meta in stream_meta_list.iter() { - for manifest in meta.snapshot.manifest_list.iter() { - if manifest.time_lower_bound.date_naive().to_string() == date { + let target_date = chrono::NaiveDate::parse_from_str(date, "%Y-%m-%d") + .map_err(|e| StreamError::Custom { + msg: format!("invalid date '{}': {e}", date), + status: http::StatusCode::BAD_REQUEST, + })?; + + for meta in stream_meta_list { + for manifest in &meta.snapshot.manifest_list { + if manifest.time_lower_bound.date_naive() == target_date { events_ingested += manifest.events_ingested; ingestion_size += manifest.ingestion_size; storage_size += manifest.storage_size; } } }Benefits
• Eliminates per-iterationString
allocations.
• Guards against malformed dates in the query string.
• Avoids subtle bugs caused by locale-dependentto_string()
formats.
🧹 Nitpick comments (3)
src/handlers/http/modal/query/querier_logstream.rs (1)
179-185
: Redundanttotal_stats
re-construction
stats
already ownsevents
,ingestion
, andstorage
; wrapping it into a newStats
with identical fields is unnecessary:-let total_stats = Stats { - events: stats.events, - ingestion: stats.ingestion, - storage: stats.storage, -}; -let stats = serde_json::to_value(total_stats)?; +let stats = serde_json::to_value(&stats)?;Reduces one allocation and keeps the intent clearer.
src/prism/home/mod.rs (2)
221-224
: Avoid cloning heavy metadata vectors in hot path
stream_wise_meta.values().map(|meta| get_stream_stats_for_date(date.clone(), meta.clone()))
clones the fullVec<ObjectStoreFormat>
for every stream/date pair, which may be large.You can pass a slice to avoid clones:
-.map(|meta| get_stream_stats_for_date(date.clone(), meta.clone())); +.map(|meta| get_stream_stats_for_date(date.clone(), meta));and adapt the callee to accept
&[ObjectStoreFormat]
.This reduces memory churn and speeds up the home-page stats aggregation.
249-252
: Propagate “no-data” condition explicitlyIf
fetch_daily_stats
returns zeroed stats becausemeta
contained no manifests for that date, the home page silently shows zeros. Consider returningNone
for “no data” and filtering it out upstream to distinguish between “0 events” and “no data”.if stats.events == 0 && stats.ingestion == 0 && stats.storage == 0 { return Ok(None); }This yields a more accurate UI/UX.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/handlers/http/cluster/mod.rs
(1 hunks)src/handlers/http/modal/query/querier_logstream.rs
(3 hunks)src/prism/home/mod.rs
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: coverage
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (2)
src/handlers/http/modal/query/querier_logstream.rs (2)
167-177
: Vector rename looks goodPure renaming (
ingestor_stream_jsons
→stream_jsons
) with no semantic impact.
No concerns here.
39-43
:❓ Verification inconclusive
Possible double-counting after broadening the file filter
The previous predicate filtered only files starting with
.ingestor
to ensure each manifest is counted exactly once.
After the change, the predicate accepts all"stream.json"
files, e.g. those prefixed with.ingestor
,.querier
,.indexer
, etc.
If other components write their ownstream.json
, the same manifests may now be aggregated multiple times, inflating the daily stats.Please confirm that only ingestors produce
stream.json
files, or reinstate a stricter predicate such as:-Box::new(|file_name| file_name.ends_with("stream.json")) +Box::new(|file_name| file_name.starts_with(".ingestor") && file_name.ends_with("stream.json"))
🏁 Script executed:
#!/bin/bash # Locate all occurrences of the stream.json file filter rg -n 'ends_with("stream.json")' -n . # Show surrounding context for any Box::new predicates involving stream.json rg -C5 'Box::new.*stream\.json' -n .Length of output: 5013
Verify
stream.json
Origins to Avoid Double-CountingI located six uses of the relaxed
ends_with("stream.json")
predicate across the codebase (e.g., insrc/query/...
,src/prism/home/mod.rs
,src/storage/object_storage.rs
,src/enterprise/utils.rs
, and your handlersrc/handlers/http/modal/query/querier_logstream.rs
). Unless every component’sstream.json
files are uniquely scoped by directory (e.g. only ingestors write them under a.ingestor…/stream
path), you risk counting the same manifest multiple times.Please confirm that no non-ingestor components emit their own
stream.json
; if they do, narrow the filter:- Box::new(|file_name| file_name.ends_with("stream.json")) + Box::new(|file_name| file_name.starts_with(".ingestor") && file_name.ends_with("stream.json"))
- Files needing review:
- src/handlers/http/modal/query/querier_logstream.rs (line 163)
- src/query/mod.rs (line 475)
- src/query/stream_schema_provider.rs (line 511)
- src/prism/home/mod.rs (line 172)
- src/storage/object_storage.rs (line 680)
- src/enterprise/utils.rs (line 84)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/handlers/http/modal/query/querier_logstream.rs (1)
146-147
: Consider improving query string parameter validation.The current approach to splitting the query string is fragile and could panic if the query string doesn't follow the expected format (e.g., if it has multiple parameters or none).
Consider using a more robust parsing approach:
- let date_key = query_string.split('=').collect::<Vec<&str>>()[0]; - let date_value = query_string.split('=').collect::<Vec<&str>>()[1]; + let parts: Vec<&str> = query_string.split('=').collect(); + if parts.len() != 2 { + return Err(StreamError::Custom { + msg: "Invalid query parameter format".to_string(), + status: StatusCode::BAD_REQUEST, + }); + } + let date_key = parts[0]; + let date_value = parts[1];
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/handlers/http/modal/query/querier_logstream.rs
(3 hunks)src/prism/home/mod.rs
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/prism/home/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: coverage
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
🔇 Additional comments (3)
src/handlers/http/modal/query/querier_logstream.rs (3)
39-39
: Import refactoring reflects the unified stats approach.The import change from
fetch_daily_stats_from_ingestors
tofetch_daily_stats
aligns with the PR objective of consolidating the way date-level statistics are calculated rather than handling querier and ingestor stats separately.
163-163
: Simplified file filtering logic matches the consolidation approach.The filtering predicate has been simplified to include all files ending with "stream.json" rather than only those from ingestors. This change supports the PR objective of reading all stream files to aggregate stats instead of calculating them separately by component.
167-167
: Renamed variable improves semantic clarity.The variable rename from
ingestor_stream_jsons
tostream_jsons
better reflects that we're now collecting all stream JSON files, not just those from ingestors.Also applies to: 176-176
read all stream.json files for a dataset
get sum of stats for a given date from the manifest list in snapshot
no need to calculate querier / ingestor stats separately
fix applicable for -
/api/v1/logstream//stats?date=yyyy-mm-dd
and
/api/prism/v1/home
Summary by CodeRabbit